Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move array ArrayAgg to a UserDefinedAggregate #11448

Merged
merged 12 commits into from
Jul 19, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jul 13, 2024

Which issue does this PR close?

Closes #.
Part of #10999, #8708

TODO

  • OrderSensitiveArrayAgg

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Jul 13, 2024
@jayzhan211 jayzhan211 changed the title Convert ArrayAgg to UDAF ArrayAgg UDAF Jul 13, 2024
eejbyfeldt and others added 9 commits July 17, 2024 13:35
This follows how it done for input_type and only provide a single value.
But might need to be changed into a Vec in the future.

This is need when we are moving `arrag_agg` to udaf where one of the
states nullability will depend on the nullability of the input.
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
}

#[cfg(test)]
mod tests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for distinct array agg is already coverdd in

# test array_agg_distinct with list data type
statement ok
CREATE TABLE array_agg_distinct_list_table AS VALUES
('w', [0,1]),
('w', [0,1]),
('w', [1,0]),
('b', [1,0]),
('b', [1,0]),
('b', [1,0]),
('b', [0,1])
;
# Apply array_sort to have deterministic result, higher dimension nested array also works but not for array sort,
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`
query ??
select array_sort(c1), array_sort(c2) from (
select array_agg(distinct column1) as c1, array_agg(distinct column2) as c2 from array_agg_distinct_list_table
);
----
[b, w] [[0, 1], [1, 0]]

Remove the test here doesn't reduce test coverage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though it looks like the comments should be adjusted):

# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`

@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> {
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")),
bool_and(lit(true)),
bool_or(lit(true)),
array_agg(lit(1)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add array_agg_distinct since it is not exist, and we can get distinct one with builder mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is builder mode ? You mean a builder for array_agg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jayzhan211 means that there is a single array_agg udf function now, and internally it has special accumulators for distinct and non distinct

Thus there is no difference in how to serialize it I think

Maybe we could use the builder to test

array_agg(lit(1)).distinct().build().unwrap(),

🤔

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@@ -63,160 +61,3 @@ pub use negative::{negative, NegativeExpr};
pub use no_op::NoOp;
pub use not::{not, NotExpr};
pub use try_cast::{try_cast, TryCastExpr};

#[cfg(test)]
pub(crate) mod tests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no usage

@jayzhan211 jayzhan211 marked this pull request as ready for review July 17, 2024 08:12
@jayzhan211 jayzhan211 requested a review from alamb July 18, 2024 07:36

// TODO: change name to lowercase
fn name(&self) -> &str {
"ARRAY_AGG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for having this in uppercase ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly it is inconsistent with count which uses lower case:

fn name(&self) -> &str {
"count"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why they are upper case initially, however we plan to convert them all to lowercase

if values.is_empty() {
return Ok(());
}
assert!(values.len() == 1, "array_agg can only take 1 param!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will invoke the panic!, should we return an Err here ?

if states.is_empty() {
return Ok(());
}
assert!(states.len() == 1, "array_agg states must be singleton!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Can we remove all assert! and return Err

@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> {
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")),
bool_and(lit(true)),
bool_or(lit(true)),
array_agg(lit(1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is builder mode ? You mean a builder for array_agg

distinct = true;
protobuf::AggregateFunction::ArrayAgg
} else if aggr_expr.downcast_ref::<OrderSensitiveArrayAgg>().is_some() {
// TODO: remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderSensitiveArrayAgg is not yet removed

fn default() -> Self {
Self {
signature: Signature::any(1, Volatility::Immutable),
alias: vec!["array_agg".to_string()],
Copy link
Contributor

@dharanad dharanad Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the smallcase as alias ? Shouldn't this be the function name itself ?

ArrayAgg,
array_agg,
expression,
"input values, including nulls, concatenated into an array",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not clear & can we improve for clarity just the other functions ?

@@ -171,6 +171,7 @@ pub fn max(expr: Expr) -> Expr {
))
}

// TODO: remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can remove this

Ok(())
}

// Append value like ListArray(Int64Array(1,2,3), Int64Array(4,5,6))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If this is code document, then we can move this inside the function.

@alamb alamb changed the title ArrayAgg UDAF Move array ArrayAgg to a UserDefinedAggregate Jul 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 and @dharanad for the review

I think this PR is good enough to merge, though I think addressing @dharanad 's comments beforehand would make it even better.

We are so close to completing user defined aggregate migration. so close!

// specific language governing permissions and limitations
// under the License.

//! Defines physical expressions that can evaluated at runtime during query execution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Defines physical expressions that can evaluated at runtime during query execution
//! `ARRAY_AGG` aggregate implementation: [`ArrayAgg`]


// TODO: change name to lowercase
fn name(&self) -> &str {
"ARRAY_AGG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly it is inconsistent with count which uses lower case:

fn name(&self) -> &str {
"count"

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though it looks like the comments should be adjusted):

# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`

@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> {
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")),
bool_and(lit(true)),
bool_or(lit(true)),
array_agg(lit(1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jayzhan211 means that there is a single array_agg udf function now, and internally it has special accumulators for distinct and non distinct

Thus there is no difference in how to serialize it I think

Maybe we could use the builder to test

array_agg(lit(1)).distinct().build().unwrap(),

🤔

Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 19, 2024
@jayzhan211
Copy link
Contributor Author

Thanks @dharanad and @alamb

@jayzhan211 jayzhan211 merged commit 12d82c4 into apache:main Jul 19, 2024
23 checks passed
@jayzhan211 jayzhan211 deleted the array-agg-udf branch July 19, 2024 03:59
@@ -702,6 +702,8 @@ async fn roundtrip_expr_api() -> Result<()> {
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")),
bool_and(lit(true)),
bool_or(lit(true)),
array_agg(lit(1)),
array_agg(lit(1)).distinct().build().unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* Add input_nullable to UDAF args StateField/AccumulatorArgs

This follows how it done for input_type and only provide a single value.
But might need to be changed into a Vec in the future.

This is need when we are moving `arrag_agg` to udaf where one of the
states nullability will depend on the nullability of the input.

* Make ArragAgg (not ordered or distinct) into a UDAF

* Add roundtrip_expr_api test case

* Address PR comments

* Propegate input nullability for aggregates

* Remove from accumulator args

* first draft

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

* distinct

Signed-off-by: jayzhan211 <[email protected]>

* fix

Signed-off-by: jayzhan211 <[email protected]>

* address comment

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Emil Ejbyfeldt <[email protected]>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* Add input_nullable to UDAF args StateField/AccumulatorArgs

This follows how it done for input_type and only provide a single value.
But might need to be changed into a Vec in the future.

This is need when we are moving `arrag_agg` to udaf where one of the
states nullability will depend on the nullability of the input.

* Make ArragAgg (not ordered or distinct) into a UDAF

* Add roundtrip_expr_api test case

* Address PR comments

* Propegate input nullability for aggregates

* Remove from accumulator args

* first draft

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

* distinct

Signed-off-by: jayzhan211 <[email protected]>

* fix

Signed-off-by: jayzhan211 <[email protected]>

* address comment

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Emil Ejbyfeldt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants